


Institutional Archive of the Naval Postgraduate School 





Calhoun: The NPS Institutional Archive 
DSpace Repository 


Theses and Dissertations 1. Thesis and Dissertation Collection, all items 


1960 


A derivation of the basic statistic of 
information theory. 


Kolar, Robert P. 


Monterey, California: U.S. Naval Postgraduate School 
http://ndl.handle.net/10945/11931 


This publication is a work of the U.S. Government as defined in Title 17, United 
States Code, Section 101. Copyright protection is not available for this work in the 
United States. 


Downloaded from NPS Archive: Calhoun 


: Calhoun is the Naval Postgraduate School's public access digital repository for 
/ (8 D U DLEY research materials and institutional publications created by the NPS community. 
«ist : Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS's first 


NY KNOX appointed — and published — scholarly author. 


LIBRARY Dudley Knox Library / Naval Postgraduate School 
411 Dyer Road / 1 University Circle 
Monterey, California USA 93943 





http://www.nps.edu/library 


NPS ARCHIVE 


1960 
KOLAR, R. 





A DERIVATION OF THE BASIC STATISTIC 
OF INFORMATION THEORY, 


ROBERT P. KOLAR 


LIBRARY 
US. NAVAL POSTGRADUATE SCHOOL 
MONTEREY, CALIFORNIA 


DUDLEY KNOX LIBRARY 
NAVAL POSTGRADUATE SCHOOL 
MONTEREY, CA 93943-5101 

















UNITED STATES 
NAVAL POSTGRADUATE SCHOOL 





THESIS 


A DERIVATION OF THE BASIC STATISTIC 
OF INFORMATION THEORY 


by 


Robert P. Kolar 





12ND P2238 (1.59) 








A DERIVATION OF THE BASIC STATISTIC 


OF INFORMATION THEORY 


cok Fe He ok 


Robert P. Kolar 





A DERIVATION OF THE BASIC STATISTIC 


OF INFORMATION THEORY 


by 
Robert P. ime 
/ 


Lieutenant Commander, United States Navy 


Submitted in partial fulfillment of 
the requirements for the degree of 


MASTER OF SCIENCE 
United States Naval Postgraduate School 
Monterey, California 


io CeO 





=) 

— * = 

7 = =_> => 
_ => 
> & —————_ 
—— - ~~, << 





A DERIVATION OF THE BASIC STATISTIC 
OF INFORMATION THEORY 
by 


Robert P. Kolar 


This work is accepted as fulfilling 
the thesis requirements for the degree of 
MASTER OF SCIENCE 
from the 


United States Naval Postgraduate School 








~ ae @* —=zZ-, 


ae a | 





ABSTRACT 


Information Theory is applicable to a number of fields. 
The basic statistic of Information Theory, oe log Ps Ls 
derived for the discrete case using Bayes's ie for the 
probability of causes. Various properties of this function 
are derived and discussed. 

The writer wishes to express his appreciation for the 
assistance and encouragement given him by Professor 
Randolph Church of the U. S. Naval Postgraduate School in 


this investigation. 
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ine ~ Introdueeton 


In 1948 Claude Shannon published his now famous paper 
entitled "The Mathematical Theory of Communication" [ 13 | ; 
later reprinted with a paper by Warren Weaver as reference 
| 12 | , in which he defined a communication system as being 
composed of an information source, a transmitter, a channel, 
a receiver, and the destination. The fundamental problem in 
such a system is the reproduction at the destination either 
exactly or approximately a message selected at the informa- 
tion source. His approach was statistical in nature, that is, 
he did not consider the semantic aspects of the message but 
rather that the message is one of a set of possible messages. 
Among other things he showed that under certain conditions it 
is possible to encode the transmitted information so that it 
would be received with an arbitrarly small frequency of error. 

Basic to his arguments is a concept Known in thermody- 
namics as entropy, which can be described as a measure of the 
amount of disorder in a physical system. This implies that a 
message, in some sense, represents a certain amount of dis- 
order and that there is a measure of this disorder which can 
be obtained and used. 

Kullbach | 7 recently published a book in which he 
uses the same basic concept to provide a unifying background 


BoE the study of the testing of statistical hypotheses, 
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Bagno | ie uses Shannon's theorems to arrive at some start- 
ling conclusions relative to economic theoryae Miller's | 9 | 
@vecicle presents a short discussion on” ehewmise someenese feo 
cepts in the field of psychology. Thus Information Theory 
apparently has a wide field of applications. 

What is information? One aereeio ne lists among its 
several definitions "... knowledge communicated or received 
concerning some fact ...". We say that knowledge is certain- 
ty and that Information Theory is the study of the ability of 
systems to transmit certainty or equivalently, of the change 
in an observer's state of uncertainty when he has performed 
experiments on a Situation and drawn conclusions (gained 
knowledge) from the results. 

In the following pages we will develop the basic sta- 
tistic of Information Theory and show that it is a logical 
and appealing measure of information in the above sense. The 
basis of the development will be Probability Theory and 


specifically an application of Bayes's rule. 


i: 
The New Century Dictionary, Appleton, Century, Crofts, 1948 
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2. Bayes Rule and Information 

Let us perform a simple experiment and see if there is 
a pattern or characteristic which can be exploited to obtain 
a statistic which relates a situation prior to one or more 
observations to the situation after the observations. 

We can represent the situation by A, where A is composed 
of m mutually exclusive and exhaustive events, REN a ee ae 


the outcomes of an observation by B, where B is composed of n 


mutually exclusive and exhaustive events BD, Das -oesbd -» An 


—— 


event a occurs with probability pla), 0o= plas 1, 
) play) l; an event °, occurs with probability Diba be. 
0< plb,) < Yew) == ve 

Consider we we have been given two nickels, told that 
one is fair and the other biased so the probability of heads 
appearing when tossed is 4%, and asked to determine which is 
the fair nickel. Let the experiment consist of choosing one 
of the two nickels at random, tossing it twice and noting 
the outcome of both tosses. Based on the results of this 
experiment we are to state the probability that the chosen 
nickel is fair. 

Let ay denote the event: the fair nickel is chosen, and 
a, denotes the event: the biased nickel is chosen. Let b, 


denote the event: the nickel comes up heads, and b, denote 


the event: the nickel comes up tails. 
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Since we make a random choice, the probability that the 
fair nickel is chosen, pla). is equal to 4%, and the proba- 
bility that the biased nickel is chosen, p(a,) is also equal 
mo . 

Theweonditional probability of a Specific outcome were 


be denoted by pte 7a) where 


p (b, /a,) = 1/2, 
Pp (b,/a,) = 1/2, 
p(b, /a,) = 1/4, 
and p(b,/a,) = 3/4. 


Considering now the first toss of the coin, from the 
— | eee : 
definition of conditional probability we can write 


chemo? = Bib 243 p(a,) (1) 


where a,b, denotes the joint occurrence of Sven and out- 
1 J 


come b.- When the a. are mutually exclusive and exhaustive 
at 
events, 
= b aja 2 
p(b,) ) P| j/aj) Play (2) 
/ 


Mie Conditional probability of event a. is the ratio of the 
probability of the joint occurrence of both events to the 


probability of the outcome, or in symbols. 


1 

W. Feller, An Introduction to Probability Theory and Its Ap- 
plications, Second Edition, John Wiley & Sons Inc., Ch. V, 
LSS) TOA 





Pp(a.b,) 
JE es) 


= ‘e 
EN pib,) 3) 
Subseicucing (1) wands (a)" inter) . 
P(b ./a,) p(a,) 
i eee 4 
Pier) ) Plb/a,) p(a,)' (4) 
/ 


which is Bayes‘s rule for the probability of SEES 1f£ we 
identify the event a. with the cause and outcome ae with the 
erftect. 

Using the given values we can calculate the conditional 
pPeOobability of ass which in the context of Bayes'‘s rule is 
known as the aposteriori probability of event a: as con-=- 
trasted with p(a,). lola eclenankena joheoloelal isles, fone (eine as 


Displayed in tabular form, 


(5) 





Thus after one toss of the nickel, the aposteriori probabil- 
ity of having chosen the fair nickel is 2/3 if we had ob- 
served heads, and 2/5 if we had observed tails. If the se- 
quence, selection and toss, was repeated a large number of 


times, the aposteriori probabilities represent the frequency 


LOG. CLE. 





with which the observer would be correct if he associated a 
Given choice with a given outcome. 
We will now toss the coin for the second time. Defining 


the joint emtcome bb j,k = 1,2, as the pair of observations 


jd 


made in the two tosses, we can extend (4) to 


i LAE, p(a,) 


yer p(a.) 
; 


a b 

p ( ne re (6) 
The conditional probability of the outcome of a single toss 
remains the same once a coin is chosen, therefore 


p(b, /a,) = P(e Eola jy = aka (7) 


Since the tosses are independent, 


CS 5) = par p(b, /a,) » (8) 


plestituting (8) into (6), 


p(b,/a,) p (b, /a,) pla.) 


= (9) 
) Oe p(b, /a,) p(a.) 
; 


a/b b 

pla, j ? 
We are now in a position to calculate the aposteriori prob- 
ability of event a. after two observations. Again using the 


given values and exhibiting the results in a table 





ena, Ay) (10) 


b.b i 
LA ky ooo 4g 
aga | 4/5 Ws 
| 
ine 4/7 377 
oa 4/7 BYF 


22 A / ie 9/13 

Thus if after two eee we 7 pee heads-heads, the 
aposteriori probability that the coin is fair would be 4/5. 

Newrce that for each additional toss the Size CE Ene 
table required to describe the possible outcomes is doubled. 
If instead of this almost trivial example we had set m andn 
equal to 50, the table required to describe the situation 
would be enormous. It would be desirable to have a simpler 
way Of presenting this data. 

The denominator in (9) can be written in a manner anal- 
ogous to (2) as Bb Dili Multiplying (9) by 


e/a.) p(a,) 
] iad: a 
ee 73) p(a,) 
/ 


and rearranging, 


pia», = 


p(b, /a,) ) pte, /a,)pla,) p(b /a.) 


° p(a,). eit) 
ae, /;) p(b, /a,) pla,) ) Pb, /a,) p(a.) 
j ] 


Simplifying the Summations, 


P(b, /a,) p(b.) p(b./a.) 


oa, = p(b,b,) p(b.) 


pla). (12) 





Notice that the expression for the aposteriori probability 

of event a. after the first observation, (4), appears on the 
right side of (11) and is multiplied by a fraction whose val- 
ue is determined by the outcomeof the second observation. 
Thus we can say that the aposteriori probability of event 

a. aftewerhe first observat#onsm=ecomes the aprionri preeba— 
bility of event a, before the second observation. If we rep- 


p (b, /a, ) joer), 


resent the fraction Tig AUb70). dehy Fo: and the 


pide.) 

p(b,/a,) jk 
fraction p(B.) in (3) and (12) by Poe we can write (3) as 
p(a,/b,) = F, pla,) (13) 


and (12) as 


sy ti®) he =F eae p(a,) (14) 


fe iS apparent that this could be extended to any number of 
observations. The aposteriori probability after say n ob- 
servations is the product of n F factors and the apriori prob- 
@erlity for the first observation. This then is the pattern 
which we will use. 

We will define information as a statistic which meas- 
ures the change in an observers belief as to which event ar 
from a set of mutually exclusive and exhaustive events A, is 
the cause of outcome me from a mutually exclusive and ex- 
haustive set of outcomes B, where ) play) = 1, ) P(b,) = l, 

J 


/ 
ando <= p(a,). oS 1. We require the following mathe- 
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matical properties of this statistic: a. additivity, and 
b. dependence on apriori and aposteriori probabilities. Ad- 
ditivity sequires that the total amount Of Mmeormacion ieb— 
tained from a sequence of observations is the sum of the a- 
mounts obtained from the individual observations. 

The EP. defined above do relate to apriori and aposte- 
PmIOateomgmantlities, The réquimement Of vagdirivity —eaneee 
met by use of the function log F for 


Bets = + + ...+t Se 
log (FF, F) log (F,) log (F,) log (F ) (5) 


Shannon as 12 | uses base two logarithms, defining the 
fee OL information as a bit, a COmtraction Of binary digic. 
One bit of information corresponds to being informed of the 
Outcome of a binary equally likely selection, a unit which 
os convenient since a relay or £lip flop circuit’ can store 
One bit Of information. Kullback | 7 uses natural loga- 
rithms since his work involves integration and differenti- 


ation and others have used base 10. 
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3s = Uneertainty 

We have indicated that a measure of the total change in 
belief of an observer is the sum of the logarithms of the F 
factors. Let us now look at the information obtained from 


the first observation of a sequence, designated as r. 


ole 
1 J 
From the definition of Foe 
p(b ,/a,) 
= P 6 
ae) oe =e3 jo) Jo) ee 
1 j J 
and from (1) and (3) we see that we can also represent I. u 
i 
p(a,/b.) 
as BI = log —__+~—1_ (17) 
a.b, p(a,) 
li | at 
p(a.b.) 
and I = lke; (18) 
b b 
ue, p(a,) p ( 3) 


Confining our attention to (17) for the moment, if aD.) oe 
greater than p(a,), the information is positive; if less 
than p(a,). the information is negative. Positive informa- 
mien COrresponds to an increase in certainty. If pla,/b,) 
was equal to one, we would be certain that a. was the cause 
of our observed outcome. We will define the uncertainty of 


event a, as Ene value of r | when ) =e thie 


.b, 
1) 
uncertainty of event a. is - log pla). This is the maximum 
amount of information which can be obtained concerning a. igi 


one observation. We are more concerned however with the en=— 


tire set A so let us first obtain the AaVelLdaGewUNnCe lied tty 


10 
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which we will denote by H(A). 
Ae = —\ pi ie) a it 
H(A) = —) p{a,) log p(a,). (19) 
i 
H(A) is descriptive of set A or of any other set with the 
Same number of events and the same probability distribution. 


We can also speak of the average uncertainty of observation 


set B, and the event-outcome pair set AB as 


H(B) = -) P(b,) log p(b,). (20) 

and J 
H(AB)=-) ) pla,b,) log Pea (Zar) 

—-) 


Consider now the average information obtained on a par- 
Eicular event, as 1f we average over all possible outcomes 


of the first observation. We will designate this average as 


I oe 
oT ea 
aay, 
i 7 ED. ome) I 0 
a By ay j= 29); 

p(b./a.) 

| 1 
a ees log ~ tae | (22) 


This average is obviously equal to zero when MER Aa) 

is equal to p(b); recalling from (2) that Saee is equal to 
»P Bees) P(a,) , we see that this 1S Jen) posstble wien 
p(a,) is equal to unity, or when all the p(b/a,) are equal 
mor a given i. In the first case we would obtain no informa- 
tion since a. is the only event which can occur, and in the 
second case we would obtain no information since ae 1s inde- 


rl 





pendent of a.- We will show that in all other cases I’: 
pee 


is greater than zero. An inequality from Feinstein | 4 | can 


be written as 


lo <— vy. 
» ay gP, S )a, loga, (23) 
/ / 
bon O=dy, P,= 1, )P; =p >a; =) 2 > eae es 
/ / 


equality only when Pp, = qd. - Separating the logarithm in (22), 


oe YPmy/ay log p(b,/a,) “eB (b/a,) log p(b,) (24) 


and identifying p(b,/a,) with di p(b,) with P,. we see that 
the second term on the right is always less than or equal to 
the first term. Now since the logarithms of numbers less 
than one are negative, the difference is always greater than 
Or equal to zero. Therefore the average information from 
one observation on a single event is a positive quantity 
Berresponding to a decréase in uncertainty, Or is zero cor— 
responding to no change. Thus 


I. eee (25) 
ota 


We will now show that the average information obtained 
from a sequence of observations on the same event cannot ex- 


ceed the uncertainty of the event. From (3) and (12) we can 


write F. as 
2 pla,/P,) 


and the average information obtained from the second obser- 


vation as 


AZ 


p(a./b.b ) 


ak kK 


Te = » 7) Pie, /a,) Ay. log EW 
L Z J K 1. 


; (27) 
) 
J 
By an argument identical to that used to prove (25) this 


average is also greater than or equal to zero. Separating 


the logarithms in I eure! ih we can write the sum as 
Eg) a, Bo 


*a,B, + Ta B, = dew /a log play. z OPN, log p(a,) 


+) Pm, /2) De dheaiey log ee 


—) Pia) p(b,/a,) log pla,/b,) . (28) 
y kK 


If we now sum the last term on the right over k, we see that 
the result is identical to the first term, thus both terms 


cancel out, leaving us 


Tne + ae = »_) Pie, /a,) PARE log Bh aoe 
sD 1 2 JI « 


- ) plb,/a,) log p(a,). (29) 
J 


Similarly, in a sequence of say n observations, the first 
term in the expression for the average information from one 
observation in the sequence will cancel out with the second 
term in the expression for the average information from the 
next observation in the sequence. The sum will therefore 


consist of two terms such as 


r=h 
dt. = ) sal Loetetee oe log NCSA, -b ) 
y=) 
) P(b,/a,) log p(a,) (30) 
J 


3 
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where oe is the first outcome and be) the last outcome. Now 


the maximum value which the first term on the right can at- 


tain is zero, corresponding to eT eee = 1, and the 
second term is just the uncertainty of event a: Thus 
r=h 
max I = -] ‘ 
> a B og p(a) (a) 
y=! i is 


Returning to the case of a Single observation, let us 


average IT. 5 over all events a, and see what the overall 
i.) 


average, that is, over both events and outcomes, looks like. 


By (18) we can write (22) as 


p(a,b.) 
IB = )P (b ,/a,) log Ly Se (32) 
A y aL J 


The average of (32) over all events, designated by I, is 


| pla.b.) 
I, = ppt pean log oe Pte eal 


By (1) and (3) and separating the logarithms, 


I, = > Pla,b) log Dae - log p(a,) -log p(b .) | 432) 
fod 
By (19), (20) and (21) we see that 


I, = -H(AB) + H(A) + H(B). 3s) 


Since we have established that I. is nonnegative, r, must 


.B 
ae 


also be nonnegative. This implies that 

H(AB) = H(A) + H(B), (36) 
with equality when A and B are independent, or one or both 
consist of one event with probability one. We can also write 


14 





(33) wineie form 


- y Pla) eae log veya, log Eee e(3/) 
/ 


The first term inside the brackets iS Similar in form to 
(19), (20) and (21) and we will define the uncertainty of 
outcome set B, given a specific event a. as 


H(B/a,) = —) Plo, /a,) log p(b./a,) . (38) 
J 


Rearranging the right side of (37) and SubstLeEuting es (jcle 


I= - pla) H(B/a.) + ) LPP) log p(b.) . (39) 
I J 


Now the first term on the right is the average conditional 
uncertainty of B and the second term is the uncertainty of B, 
thus, denoting the average conditional uncertainty by H(B/A), 


I, = -H(B/A) + H(B) (40) 


Since a is nonnegative, 


H(B/A) = H(B), (41) 
with equality as in (36). 

(16) and (17) are symmetric expressions in a and De SO 
an equivalent averaging of (17) would yield 

= -H(A/B) + H(A) (42) 


and 


H(A/B) < H(A). (43) 


— 


1S, 





4. s#repercres of the Uncertainty Seaeisceic 
A number of properties were obtained in the previous 


section namely: 


H(AB) — H(A) + H(B). (36) 
H(B/A) = H(B). (41) 
and H(A/B) = H(A). (43) 


We will now consider this function in its own right and 
obtain some additional properties, Subtracting (40) from (35) 
we find that 

H(AB) = H(A) + H(B/A), (44) 
and subtracting (42) from (35), 
H{(AB) = H(B) + H(A/B) . (45) 

Since H is of the form -\P, log a we may represent it 
as H(p,. ee Pp) By the srotente of the logarithm, 


mam x log x = OQ, 
x20 


: ae, = Hip, iesseee 
H(p, Pp ' p29) H{p, P, . Pp) (46) 


ah 
which indicates that adding an impossible event to a set does 
MOL Change the average uncertainty. 

Let us now determine the probability distribution for 
which the H function takes on its maximum value. Using the 


1 
method of Lagrange multipliers and writing H as a function 


of all its arguments, we form 


‘Kaplan. W, Advanced Calculus. Addison Wesley Publishing Co. 
Pewee —i 25° 1953 . 
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= 
I! 


Hie ep, ea i DPBS. - 1] - (47) 
| / 


m 


Or 


= 
lI 


= 1 + : | 48 

Yr tose, + ALD, Wo 
/ 

Taking partial derivitives of W with respect to Po: 


0 we-=-p. 2 log BP, = leguam +. for i=l, 2, ..m.(49) 


oP, * op 


> 


i 
Setting this equal to zero to obtain the extreme point and 
solving £Or log Dis 

log P, = Noe iL (50) 


exp op = 1) (51) 


ma 
Summing both sides of (51) on 1 and solving for exp (A - 1), 
exp ( = = ae (52) 
Siilostitutcing (52) into (51), 
Ss Waite (53) 
Thus the probability distribution for which the average un- 
certainty is the maximum is the uniform distribution, or 

H(pl+ Pores Pp) = H(Ui/7m, 17m) (54) 
with equality when ve Ibu, 

If we have two sets, one with m events, the other with 
m+ l1events, the maximum average uncertainty of the smaller 
set is less than the maximum average uncertainty of the 
larger one. This can be seen by computing and noting the 


right side of (54) for both cases. 


slog Lyme —/log : Omni 4 lee (55) 
Tee, 
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ileal i 
+1 mtl’’*? ml 


1 
Thais: lel age 
m 


aj i= 


bias, =) =e (56) 
mM m 


Consider the form of H when one event say m, is composed 


- en: 


of two sub-events such that p 
m Ll Z 


1Q 
ne. 
il 


m- | 
Hoe D.. . - | + 
‘P+ Pores Phy! Gs Ay »Pi otek xn log dys (97) 
\ 
and 


a = i: rs 
H(p, , pe) YP; og P, se ee ae (58) 


Subtracting (57) from (56) and combining the logarithms of 


the terms on the extreme right, 


q <I 
1 2 
eee; ae Hh 6 See 
PJ! +P, (59) 


™, mM 


H > 2) ee 
(Pp, p 


—— Jl : 
3 q,) (Pp) p 


lt ve 2 


We had previously stated that H was descriptive of its 
set of arguments but not that it was unique. Shannon tat, 12| 
shows that the only function continuous in P, and possessing 
properties (56) and (59) is - ADP, log Pie where Xr is an 
Peoltrary Constant. @Khanchin 6 | shows the same result 
using properties (44), (46) and (54). 

For applications of this statistic the reader is referred 


to the bibliography. 
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